DNN: A Distributed NameNode Filesystem for Hadoop

نویسنده

  • Ziling Huang
چکیده

The Hadoop Distributed File System (HDFS) is the distributed storage infrastructure for the Hadoop big-data analytics ecosystem. A single node, called the NameNode of HDFS stores the metadata of the entire file system and coordinates the file content placement and retrieval actions of the data storage subsystems, called DataNodes. However the single Na-meNode architecture has long been viewed as the Achilles' heel of the Hadoop Distributed file system, as it not only represents a single point of failure, but also limits the scalability of the storage tier in the system stack. Since Hadoop is now being deployed at increasing scale, this concern has become more prominent. Various solutions have been proposed to address this issue, but the current solutions are primarily focused on improving availability, ignoring or paying less attention to the important issue of scalability. In this thesis, we first present a brief study of the state-of-art solutions for the problem, assessing proposals from both industry and academia. Based on our unique observation of HDFS that most of the metadata operations in Hadoop workload tend to have direct access rather than exploiting locality, we argue that HDFS should have a flat namespace instead of the hierarchical one as used in traditional POSIX-based file system. We propose a novel distributed NameNode architecture based on the flat namespace that improves both the availability and scalability of HDFS, using the well-established hashing namespace partitioning approach that most existing solutions avoid to use because of the loss of hierarchical. We also evaluate the enhanced architecture using a Hadoop cluster, applying both a micro metadata benchmark and the standard Hadoop macro benchmark. iii Acknowledgment I would like to thank my advisor, Dr. Hong Jiang for his invaluable guidance and support over the past few years. He led me into this wonderful research area, taught me how to do research, and how important critical thinking is. I would like to thank Dr. David Swanson and Dr. Ying Lu for serving as my committee member and reviewing my thesis. I would like to thank all the members of the ADSL lab for their support and suggestions. I would like to express my special thanks to my friend Lei Xu, for his help, guidance, and friendship over the years. I thank all my collaborators of the Advanced Development & Architecture team at NetApp Inc. as this work started when I was an summer intern in the team. …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Recovery System for Hadoop Cluster

Due to brisk growth of data volume in many organizations, large-scale data processing became a demanding topic for industry as well as for academic fields. Hadoop is widely adopted in Cloud Computing environment for unstructured data. Hadoop is an open source, a java based distributed computing framework, and supports large-scale distributed data processing. In the recent years, Hadoop Distribu...

متن کامل

High Scalability of HDFS using Distributed Namespace

In data intensive computing, Hadoop is widely used by organizations. The client applications of Hadoop require high availability and scalability of the system. Mostly, these applications are online and their data growth rate is unpredictable. The present Hadoop relies on secondary namenode for failover which slows down the performance of the system. Hadoop system’s scalability depends on the ve...

متن کامل

Sustainability of Hadoop Clusters

Hadoop is a set of utilities and frameworks for the development and storage of distributed applications in cloud computing, the core component of which is the Hadoop Distributed File System (HDFS). NameNode is a key element of its architecture, and also its “single point of failure”. To address this issue, we propose a replication mechanism that will protect the NameNode data in case of failure...

متن کامل

HEBR: A High Efficiency Block Reporting Scheme for HDFS

Hadoop platform is widely being used for managing, analyzing and transforming large data sets in various systems. Two basic components of Hadoop are: 1) a distributed file system (HDFS) 2) a computation framework (MapReduce). HDFS stores data on simple commodity machines that run DataNode processes (DataNodes). A commodity machine running NameNode process (NameNode) maintains meta data informat...

متن کامل

NameNode and DataNode Coupling for a Power-Proportional Hadoop Distributed File System

Current works on power-proportional distributed file systems have not considered the cost of updating data sets that were modified (updated or appended) in a low-power mode, where a subset of nodes were powered off. Effectively reflecting the updated data is vital in making a distributed file system, such as the Hadoop Distributed File System (HDFS), power proportional. This paper presents a no...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016